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Abstract In this paper we promote introducing software verification and control 
flow graph similarity measurement in automated evaluation of students' programs. 
We present a new grading framework that merges results obtained by combina- 
tion of these two approaches with results obtained by automated testing, leading 
to improved quality and precision of automated grading. These two approaches are 
also useful in providing a comprehensible feedback that can help students to im- 
prove the quality of their programs We also present our corresponding tools that 
are publicly available and open source. The tools are based on LLVM low-level 
intermediate code representation, so they could be applied to a number of pro- 
gramming languages. Experimental evaluation of the proposed grading framework 
is performed on a corpus of university students' programs written in program- 
ming language C. Results of the experiments show that automatically generated 
grades are highly correlated with manually determined grades suggesting that the 
presented tools can find real- world applications in studying and grading. 
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1 Introduction 

Automated evaluation of programs is beneficial for both teachers and students 
(Pears, Seidman, Malmi, Mannila, Adams, Bennedsen, Devlin, & Paterson, 2007). 
For teachers, automated evaluation is helpful in grading assignments and it leaves 
more time for other activities with students. For students, it provides immediate 
feedback which is very important in process of studying, especially in computer sci- 
ence where students take a challenge of making the computer follow their intentions 
(Nipkow, 2012). Immediate feedback is particularly helpful at first programming 
courses where students have frequent and deep misconceptions (Vujosevic-Janicic 
& Tosic, 2008). 

Benefits of automated evaluation of programs are even more significant in the 
context of online learning. A number of world's leading universities offer numerous 
online courses. The number of students taking such courses is measured in mil- 
lions and quickly growing (Allen & Seaman, 2010). In online courses, the teaching 
process is carried out on the computer, the contact with teacher is already mini- 
mal and hence the fast and substantial automatic feedback is especially desirable. 
Therefore, automation of evaluation tasks in online learning is very important. 

Most of the tools for automated evaluation of students' code are based on auto- 
mated testing (Douce, Livingstone, & Orwell, 2005). Testing is used for checking 
functional correctness of student's solution, i.e., whether the student's program 
exhibits the desired behavior on selected inputs. Testing can also be used for 
detecting bugs. We consider bugs to be runtime errors and exclude errors that 
only compromise functional correctness (for example, in programming language 
C, some important bugs are buffer overflow, null pointer dereferencing and divi- 
sion by zero) . Although there is a variety of software verification tools that could 
enhance automated bug finding in students' programs (by analyzing the code with- 
out executing it), these tools are usually too complex to use and cannot be easily 
adapted for educational purposes. 

In addition to checking functional correctness, an evaluation tool may also 
analyze program efficiency and/or complexity by profiling. Relevant aspects of 
program quality are also it's design and modularity (adequate decomposition of 
code to functions) . These issues are addressed by checking similarity to a teacher 
provided solution. In order to check similarity, aspects that can be analyzed are: 
frequencies of keywords, number of lines of code, number of variables etc. Recently, 
a more sophisticated approach of grading students' programs by measuring the 
similarity of related graphs has been proposed (Wang, Su, Wang, & Ma, 2007; 
Naude, Greyling, & Vogts, 2010). Recent surveys of related approaches are given 
elsewhere (Ala-Mutka, 2005; Ihantola, Ahoniemi, Karavirta, & Seppala, 2010). 

In this paper, we propose a new grading framework for automated evaluation 
of students' programs aiming primarily at introductory programming courses. The 
framework is based on merging information from three different evaluation meth- 
ods: it merges results obtained by software verification (automated bug finding) 
and control flow graph (CFG) similarity measurement with results obtained by 
automated testing. The synergy between automated testing, verification, and sim- 
ilarity measurement improves the quality and precision of automated grading and 
overcoming the individual weaknesses of these approaches. Our experimental re- 
sults show that our framework can lead to a grading model that highly correlates 
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to manual grading and therefore gives promises for real-world applicability in ed- 
ucation. 

We also briefly discuss tools for software verification (Vujosevic-Janicic & Kun- 
cak, 2012) and CFG similarity (Nikolic, 2013), that we use for assignment evalua- 
tion. These tools, based on novel methods, are publicly available and open source. 1 
Both tools use the low-level intermediate code representation LLVM. Therefore, 
they could be applied to a number of programming languages and could be com- 
plemented with other existing LLVM based tools (e.g., tools for automated test 
generation). Also, the tools are enhanced with support for meaningful and compre- 
hensible feedback to students, so they can be used both in the process of studying 
and in the process of grading assignments. 

Overview of the paper. Necessary background information is given in Section 2. 
Motivating examples for the synergy of the three proposed approaches are given 
in Section 3. The grading setting and the corpus used for evaluation are described 
in Section 4. The role of the verification techniques in automated evaluation is 
discussed in Section 5 and the role of structural similarity measurement is discussed 
in Section 6. An experimental evaluation of the proposed framework for automated 
grading is presented in Section 7. Section 8 contains information about related 
work. Conclusions and outlines of possible directions of future work are given in 
Section 9. 



2 Background 

This section provides an overview of intermediate languages, the LLVM tool, soft- 
ware verification, the LAV tool, control flow graphs and graph similarity measure- 
ment. 

Intermediate languages and LLVM. An intermediate language separates concepts 
and semantics of a high level programming language from low level issues relevant 
for a specific machine. Examples of intermediate languages include the ones used in 
LLVM and .NET framework. LLVM 2 is an open source, widely used, rich compiler 
framework, well suited for developing new mid-level language-independent anal- 
yses and optimizations of all sorts (Lattner & Adve, 2002). LLVM intermediate 
language is assembly-like language with simple RISC-like instructions. It provides 
easy construction of control flow graphs of program functions and of entire pro- 
grams. There is a number of tools using LLVM for various purposes, including 
software verification. LLVM has front-ends for C, C++, Ada and Fortran, while 
there are external projects for translating a number of other languages to LLVM 
intermediate representation (e.g., Python, Ruby, Haskell, Java, D, PHP, Pure, and 
Lua). 

Software verification and LAV. Verification of software and automated bug finding 
are some of the greatest challenges in computer science. Software bugs cost the 
world economy billions of dollars annually (Tassey, 2002). Software verification 



1 http : //argo .matf . bg . ac . rs/?content=lav 

2 http://llvm.org/ 
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tools aim at automatically checking correctness properties. Different approaches 
to automated checking of software properties exist, such as symbolic execution 
(King, 1976), model checking (Clarke, 2008) and abstract interpretation (Cousot & 
Cousot, 1977). Software verification tools usually use automated theorem provers. 

LAV (Vujosevic-Janicic & Kuncak, 2012) is an open-source tool for statically 
verifying program assertions and locating bugs such as buffer overflows, pointer 
errors and division by zero. LAV uses popular LLVM infrastructure. As a result, 
it supports several programming languages that compile into LLVM, and benefits 
from the robust LLVM front ends. LAV is primarily aimed at programs in the C 
programming language, in which the opportunities for errors are abundant. For 
each safety critical command, LAV generates a first order logic formula that repre- 
sents its correctness condition. This formula is checked by one of the several SMT 
solvers (Barrett, Sebastiani, Seshia, & Tinelli, 2009) used by LAV. If a command 
cannot be proved to be safe, LAV translates a potential counterexample from the 
solver into a program trace that exhibits this error. It also extracts the values of 
relevant program variables along this trace. LAV was already used, to a limited 
extent, for automated bug finding in students' assignments (Vujosevic-Janicic & 
Kuncak, 2012). 

Control flow graph. A control flow graph (CFG) is a graph-based representation of 
all paths that might be traversed through a program during its execution. Each 
node of CFG represents a sequence of commands containing only one path of 
execution (there are no jumps, loops, conditional statements, etc.). The control 
flow graphs can be produced by various tools, including LLVM. A control flow 
graph clearly separates the structure of the program and its contents. Therefore, 
it is a suitable representation for structural comparison of programs. 

Graph similarity and neighbor matching method. There are many similarity measures 
for graphs and their nodes (Kleinberg, 1999; Heymans & Singh, 2003; Blondel, 
Gajardo, Heymans, Snellart, & van Dooren, 2004; Nikolic, 2013). These measures 
have been successfully applied in several practical domains like ranking of query 
results, synonym extraction, database structure matching, construction of phyloge- 
netic trees, analysis of social networks, etc. A short overview of similarity measures 
for graphs can be found in the literature (Nikolic, 2013). 

A specific similarity measure for graph nodes called neighbor matching, possesses 
properties relevant for our purpose that other similar measures lack (Nikolic, 2013). 
It allows similarity measure for graphs to be defined based on similarity scores of 
their nodes. The notion of similarity of nodes is based on the intuition that two 
nodes i and j of graphs A and B are considered to be similar if neighbor nodes of i can 
be matched to similar neighbor nodes of j. More detailed definitions follow. 

In the neighbor matching method, if a graph contains an edge (i, j), the node 
i is called an in-neighbor of node j in the graph and the node j is called an out- 
neighbor of the node i in the graph. An in-degree id(i) of the node i is the number 
of in-neighbors of i, and an out-degree od(i) of the node i is the number of out- 
neighbors of i. 

If A and B are two finite sets of arbitrary elements, a matching of elements of 
sets A and B is a set of pairs M = € A, j 6 B} such that no element of 

one set is paired with more than one element of the other set. For the matching 
M, enumeration functions f : {1, 2, . . . k} — > A and g : {1, 2, . . . k} — >■ B are defined 
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such that M = {(f (I) , g(l))\l = l,2,...,Jfc} where k = \M\. Iiw(a,b) is a function 
assigning weights to pairs of elements a £ A and b € B, the weight of a matching 
is the sum of weights assigned to the pairs of elements from the matching. The 
goal of the assignment problem is to find a matching of elements of A and B of 
the highest weight (if two sets are of different cardinalities, some elements of the 
larger set will not have corresponding elements in the smaller set). The assignment 
problem is usually solved by the well-known Hungarian algorithm of complexity 
0(mn 2 ) where m = max(|A|, and n = min(|A|, |_B|) (Kuhn, 1955), but there 
are also more efficient algorithms. 

The calculation of similarity of nodes i and j, denoted Xij , is based on iterative 
procedure given by the following equations: 

jb+i <_ f^M±i^M 

X U i 2 

where 

i=i i=i 

mi n = max(id(i), id(j)) m ou t = max(od(i), od(j)) 

n m = mm(id(i),id(j)) n out = min(od(i), od(j)) 

where functions /™ and g™ are the enumeration functions of the optimal matching 
of in-neighbors for nodes i and j with weight function w(a,b) = x^ b , and analo- 
gously for f° ut and glf 1 - In Equations 1, § is defined to be 1 (used in case when 
m in = n in = or m ou t = n ou t = 0). Initial similarity values x®j are set to 1 for 
each i and j. The termination condition is maxjj \x^j — x ^ 1 \ < e f° r some chosen 
precision e and the iterative algorithm is proved to converge (Nikolic, 2013). 

The similarity matrix [x^] reflects the similarities of nodes of two graphs A 
and B. The similarity of the graphs can be defined as the weight of the optimal 
matching of nodes from A and B divided by the number of matched nodes (Nikolic, 
2013). 



3 The Need for Synergy of Testing, Verification, and Similarity 
Measurement 

Automated testing of programs is a very important part of the evaluation process. 
Unfortunately, the grading system is directly influenced by the choice of test cases. 
Also, no matter whether the test cases are automatically generated or manually 
designed, testing cannot guarantee neither functional correctness of a program nor 
the absence of bugs. 

For checking functional correctness, combination of random testing with eval- 
uator-supplied test cases is a common choice (Mandal, Mandal, & Reade, 2007). 
However, randomly generated test cases are not likely to hit a bug if it exists 
(Godefroid, Levin, & Molnar, 2012), while manually choosing all important test 
cases is not a trivial job and can be time consuming. It is not sufficient that 
test cases cover all important paths through the program. It is also important to 
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carefully choose values of the variables for each path — for some values along the 
same path a bug can be detected while for some other values the bug can stay 
undetected. 

Also, manually generated test cases are designed according to the expected 
solutions, while the evaluator cannot predict all the important paths through the 
student's solution. Even running a test case that hits a certain bug (for example, 
a buffer overflow bug in a C program) does not necessarily lead to any visible 
undesired behavior if the running is done in a normal (or sandbox) environment. 
Finally, if one manages to hit a bug by a test case, if the bug produces the Seg- 
mentation fault message, it is not a feedback that student can easily understand 
and use for debugging the program. In the context of automated grading, this 
feedback cannot be easily used since it may have different causes. In contrast to 
program testing, software verification tools like Pex (Tillmann & Halleux, 2008), 
Klee (Cadar, Dunbar, & Engler, 2008), S2E (Chipounov, Kuznetsov, & Candea, 
2011), CBMC (Clarke, Kroening, & Lerda, 2004), ESBMC (Cordeiro, Fischer, & 
Marques-Silva, 2009), and LAV (Vujosevic-Janicic & Kuncak, 2012) can give much 
better explanations (e.g., the kind of bug and the program trace that introduces 
an error). 



9: 

10: 

11: 

12: 

13: 

14: 



#def 
void 
{ 



ine max_size 50 

matrix_maximum(int a [] [max_size] , int rows, int columns, int b[]) 



int i, j, max=a[0] [0] ; 
for(i=0; i<rows; i++) 
{ 

for(j=0; j<columns; j++) 
if (max < a[i] [j] ) 
max = a [i] [j] ; 
b [i] = max ; 
max=a[i+l] [0] ; 

} 

return; 



int i , j , max ; 
for(i=0; i<rows; i++) 
{ 

max = a[i] [0] ; 
for(j=0; j<columns; 
if (max < a [i] [j] ) 
max = a[i] [j] ; 
b[i] = max; 

} 

return; 



Fig. 1 Buffer overflow in the code on left-hand side cannot be discovered by simple testing. 
Functionally equivalent solution without a bug is given on right-hand side. 



The example function shown at Figure 1 is extracted from a student's code 
written on an exam. It calculates the maximum value of each row of a matrix and 
writes these values into an array. This function is used in a context where the 
memory for the matrix is statically allocated and numbers of rows and columns 
are less or equal to the allocated sizes of the matrix. However, in the line 11, there 
is a possible buffer overflow bug, since i + 1 can exceed the allocated number of 
rows for the matrix. It is possible that this kind of a bug does not affect the output 
of the program or destroy any data, but in a slightly different context it can be 
harmful, so students should be warned and penalized for making such errors. The 
bugs like this one can be missed in testing but are easily discovered by verification 
tools like LAV. 

Functional correctness and absence of bugs are not the only important aspects 
of students' programs. The programs are often supposed to meet certain require- 
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ments concerning the structure of the program, such as its modularity (adequate 
decomposition of code to functions) or simplicity. Figure 2 shows two solutions 
of different modularity or structural simplicity for two problems. Neither testing, 
nor software verification can be used to assess these aspects of the programs. This 
problem can be addressed by checking the similarity of student's solution with a 
teacher provided solution, i.e., by analyzing the similarity of their related graphs 
(e.g. CFGs) (Wang et al, 2007; Naude et al, 2010; Nikolic, 2013). 3 



Problem 


First solution 


Second solution 


1. 


if(a<b) n = a; 
else n = b; 
if(c<d) m = c; 
else m = d; 


n = min(a, b) ; 
m = min(c, d) ; 


2. 


for(i=0; i<n; i++) 
for(j=0; j<n; 

if(i==j) 

m[i] [j] = 1; 


for(i=0; i<n; i++) 
m[i][i] = 1; 



Fig. 2 Examples extracted from two students' solutions of the same problem 



Finally, using similarity only (like in (Wang et al., 2007; Naude et al., 2010)) 
or even with support of a bug finding tool, would miss to penalize incorrectness 
of program's behavior. Figure 3 gives a simple example program, extracted from 
a real student's solution, that is very similar to the expected solution and without 
verification errors. However, this program is not functionally correct. Therefore, 
we conclude that the synergy of these three approaches is needed for sophisticated 
evaluation of students' assignments. 



max = ; max = a [0] ; 

for(i=0; i<n; i++) for(i=l; i<n; 

lf(a[i] > max) lf(a[i] > max) 

max = a[i] ; max = a[i] ; 

Fig. 3 Code extracted from student's solution (left-hand side) and expected solution (right- 
hand side). In the student's solution there are no verification bugs, it is very similar to the 
expected solution but it does not perform the desired behavior (in the case when all elements 
of the array a are negative integers). 



3 In Figure 2, the second example could also be distinguished by profiling for large inputs, 
because it is quadratic in one case and linear in the other. However, profiling cannot be used 
to assess structural properties in general. 



8 



Milena Vujosevic-Janicic et al. 



4 Grading Setting 

There may be different grading settings depending on aims of the course and goals 
of teachers. The setting used at an introductory course of programming in C (at 
University of Belgrade) is taking exams on computers and expecting from stu- 
dents to write working programs. In order to help students achieve this goal, each 
assignment is provided with several test cases which illustrate desired behavior 
of the solution. Students are also provided with sufficient (but limited) time for 
developing and testing programs. If a student fails to provide a working program 
that gives correct results for given test cases, his/her solution is not further ex- 
amined. Otherwise, the program is tested by additional test cases (unknown to 
students) and a certain amount of points is given corresponding to the test cases 
successfully passed. Only if all these test cases are successfully passed, the program 
is further manually examined and may obtain additional points with respect to 
other features of the program (efficiency, modularity, simplicity, absence of bugs, 
etc). 

All experiments described in this paper were preformed on a corpus of programs 
written by students on the exams, following the described grading setting. The 
corpus consists of 266 solutions to 15 different problems. These problems include 
numerical calculations, manipulations with arrays and matrices, manipulations 
with strings, and manipulations with data structures. Only programs that passed 
all test cases were included in this corpus. These programs are the main target 
of our automated evaluation technique since the manual grading was applied only 
in this case and we want to explore potentials for completely eliminating manual 
grading. These programs obtained 80% of the maximal score (as they passed all 
test cases) and additional potential 20% were given by manual inspection. The 
grades are expressed at the scale from to 10. The corpus together with problem 
descriptions and the final marks are publicly available. 4 



5 Assignment Evaluation and Software Verification 

In this section we show benefits of using software verification tool in assignment 
evaluation, e.g., generating useful feedback for students and providing improved 
assignment evaluation for teachers. 



5.1 Software verification for assignment evaluation 

No software verification tool can report all the bugs in a program without in- 
troducing false alarms (due to the undecidability of the halting problem). False 
alarms (i.e., reported "bugs" that are not real bugs) arise as a consequence of 
approximations that are necessary in modeling of programs. 

The most important approximation is concerned with dealing with loops. Dif- 
ferent verification approaches use various techniques for dealing with loops. These 
techniques range from under-approximations of loops to over-approximations of 
loops. Under-approximation of loops, as in bounded model checking techniques 
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(Clarke, 2008), uses a fixed number n for loop unwinding. In this case, if the code is 
verified successfully, it means that the original code has no bugs for n or less passes 
through the loop. However, it may happen that some bug remains undiscovered if 
the unwinding is performed an insufficient number of times. Over-approximation 
of loops can be done by simulation of first n and last m passes through the loop 
(Vujosevic-Janicic & Kuncak, 2012) or by using abstract interpretation techniques 
(Cousot & Cousot, 1977). If there are no bugs detected in the over-approximated 
code, then the original code has no bugs too. However, in this false alarm 

can appear after or inside a loop. On the other hand, precise dealing with loops, 
like in symbolic execution techniques, can be non terminating. 

False alarms are highly unwelcome in software development, but still are not 
critical — the developer can fix the problem or confirm that the reported problem 
is not really a bug (and both of these are situations that the developer can ex- 
pect and understand). However, false alarms in assignment evaluation are rather 
critical and have to be eliminated. For teachers, there should be no false alarms, 
because the evaluation process should be as automatic and reliable as possible. 
For students, there should be no false alarms because they would be confused if 
told that something is a bug when it is not. In order to eliminate false alarms, 
a system may be non-terminating or may miss to report some real bugs. In as- 
signment evaluation, the second choice is more reasonable — the tool has to be 
terminating, must not introduce false alarms, even if the price is missing some real 
bugs. These requirements make applications of software verification in education 
rather specific, and special care has to be taken when these techniques are applied. 

5.2 LAV for assignment evaluation 

LAV is a general purpose verification tool and has a number of options that can 
adapt its behavior to the desired context. When running LAV in the assignment 
evaluation context, most of these options can be fixed. 

The most important choice for the user is the choice of the way in which 
LAV deals with loops. LAV has support for both over-approximation of loops 
and for fixed number of unwinding of loops (under-approximation), two common 
techniques for dealing with loops. Setting up the upper loop bound (if under- 
approximation is used), is problem dependent and should be done by the teacher 
for each assignment. 

We use LAV in the following way. LAV is first invoked with its default pa- 
rameters — over-approximation of loops. Since this technique can introduce false 
alarms, if a potential bug is found after or inside a loop, the verification is invoked 
again but this time with fixed unwinding parameter. If the bug is still present, 
then it is reported. Otherwise, the previously detected potential bug is considered 
to be a false alarm and it is not reported. 

In software verification, each detected bug is important and should be reported. 
However, some bugs can confuse novice programmers, like the one shown in Figure 
4. In this code, at the line 11, there is a possible buffer overflow. For instance, for 
n = 0x80000001 only 4 bytes will be allocated for the pointer array, because of 
an integer overflow. This is a verification error, but a teacher may decide not to 
consider this kind of bugs. For this purpose, LAV can be invoked in mode for 
students (so the bugs like this one are not reported) . 
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i: unsigned i, n; 

2: unsigned *arr; 

3: scanf('"/,u", &n) ; 

4: array = malloc (n*sizeof (unsigned) ) ; 

5: if (array == NULL) 

6: { 

7: fprintf (stderr , "Unsuccessful allocation\n") ; 

8 : exit (EXIT_FAILURE) ; 

9: } 

10: for(i=0; i<n; i++) 
11: array [i] = i; 

Fig. 4 Buffer overflow in this code is a verification error, but the teacher may decide not to 
consider this kind of bugs. 

To a limited extent, LAV was already used on students' assignments at an 
introductory programming course (Vujosevic-Janicic & Kuncak, 2012). In these 
experiments, most of the programs from the corpus were not functionally correct. 
It was shown that the vast majority of bugs, produced by students, follow wrong 
expectations — for instance, expectations that input parameters of their programs 
will meet certain constraints and that memory allocation will always succeed. It is 
also noticed that most of the reported bugs are consequence of only few oversights. 
In many cases, omission of a necessary check produces several bugs in the rest of 
the program. Therefore, the number of bugs, as reported by a verification tool, 
is not a reliable indicator of program quality. This property will be taken into 
account in automated grading. 



5.3 Experimental evaluation 

As discussed in Section 3, programs that successfully pass a testing phase can still 
contain bugs. To show that this problem is practically important, we used LAV to 
analyze programs from the corpus described in Section 4. 

For each problem, LAV was ran with its default parameters, and programs 
with potential bugs were checked with under-approximation of loops, as described 
in Section 5. 2. 5 The results are shown in Table 1. The time that LAV spent in 
analyzing the programs was typically negligible. 6 LAV discovered bugs in 35 so- 
lutions that successfully passed the testing. There was one bug missed by manual 
inspection and detected by LAV and one bug missed by LAV and detected by 
manual inspection. The bug missed by manual inspection was the one described 
in Section 3 and given in Figure 1. The bug missed by LAV was a consequence of 
the problem formulation which was too general to allow a precise unique upper 

5 When analyzing the solutions of problems 3, 5 and 8, only under-approximation of loops 
was used. This was the consequence of the formulation of the problems given to the students. 
Namely, the formulation of these problems contained some assumptions on input parameters. 
These assumptions implied that some potential bugs should not be considered (because these 
are not bugs when these additional assumptions are taken into account). 

6 Generally, in this context, a time limit can be given to the verification tool and if it was 
exceeded no bug will be reported (in order to avoid reporting false alarms) or a program can 
be checked using the same parameters but with another underlying solver (if applicable for 
the tool). 
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loop unwinding parameter value for all possible solutions. There were just two 
false alarms produced by LAV when the default parameters were used. These false 
alarms were eliminated when the tool was invoked for the second time with a spec- 
ified loop unwinding parameter, and hence there were no false alarms in the final 
outputs. In summary, the presented results show that a verification tool like LAV 
can be used as a complement to automated testing that improves the evaluation 
process. 

Table 1 Summary of bugs in the corpus: the second column represents the number of students' 
solutions to the given problem; the third and the fourth column represents the number of 
solutions with bugs detected by manual inspection and by LAV; the fifth column gives the 
number of programs shown to be bug- free by LAV (over/under approximation); the sixth 
column gives the number of false alarms made by LAV invoked with default parameters and, 
if applicable, with under-approximation. 



problem 


# solutions 


# programs 


# programs 


# bug-free 


# false 






with bugs 


with bugs 


programs 


alarms with 






by manual 


by LAV 


by LAV 


def. /custom 






inspection 




def . / custom 


parameters 


1. 


44 








44/- 


0/- 


2. 


32 


11 


11 


20/1 


1/0 


3. 


7 


2 


2 


75 


70 


4. 


5 





1 


3/1 


1/0 


5. 


12 


3 


2 


-/io 


70 


6. 


7 








6/1 


1/0 


7. 


33 








33/- 


0/- 


8. 


31 


11 


11 


-/20 


70 


9. 


10 


6 


6 


4/0 


0/0 


10. 


14 


2 


2 


12/0 


0/0 


11. 


31 








31/- 


0/- 


12. 


18 








18/- 


0/- 


13. 


3 








3/- 


0/- 


14. 


7 








V- 


0/- 


15. 


12 








12/- 


0/- 


total 


266 


35 


35 


193/38 


2/0 



5.4 Feedback for students and teachers 

LAV can be used to provide a meaningful and comprehensible feedback to students 
while writing their programs. Information like the line number, the kind of the 
error, program trace that introduces the error and values of the variables along 
this trace, can help student improve the solution. It can also remind the student to 
add an appropriate check that is missing. The example given in Figure 5, extracted 
from a student's code written on an exam, shows the error detected by LAV and 
the generated hint. 

From the software verification support, a teacher can obtain the information 
if the student's program contains a bug. The teacher can use this information 
in grading assignments by himself. Alternatively, this information can be taken 
into account within the wider integrated framework for obtaining automatically 
proposed final grade, as discussed in Section 7. 
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1: #include<stdio.h> 

2: #include<stdlib.h> 

3: int get_digit (int n, int d) ; 

4: int main(int argc, char** argv) 

5: { 

6 : int n , d ; 

7: n = atoi (argv [1] ) ; 

8: d = atoi (argv [2] ) ; 

9: printf ("%d\n", get_digit(n, d) ) ; 

10: return 0; 

11: } 



verification failed: 
line 7: UNSAFE 



function: main 

error: buf f er_overf low 

in line 7: counterexample: 

argc == 1 , argv == 1 



HINT: 



A buffer overflow error occurs when 
trying to read or write outside the 
reserved memory for a buffer/array. 
Check the boundaries of the array! 



Fig. 5 Listing extracted from student's code written on an exam (left-hand side) and LAV's 
output (right-hand side) 

6 Assignment Evaluation and Structural Similarity of Programs 

In this section we propose a similarity measure for programs based on their control 
flow graphs, perform its experimental evaluation, and point to ways it can be used 
to provide feedback for students and teachers. 

6.1 Similarity of CFGs for assignment evaluation 

To evaluate structural properties of programs, we take the approach of compar- 
ing students' programs to solutions provided by the teacher. Student's program 
is considered to be good if it is similar to some of the programs provided by the 
teacher (Wang et al., 2007). In order to perform a comparison, a suitable program 
representation and a similarity measure are needed. As already noticed in Section 
2, there is a control flow graph (CFG) corresponding to each program. The CFG 
reflects the structure of the program. Also, there is a linear code sequence at- 
tributed to each node of the CFG which we call the node content. We assume that 
the code is in the intermediate LLVM language. In order to measure the similarity 
of programs, both the similarity of graphs' structures and the similarity of node 
contents should be considered. We take the approach of combining the similarity 
of node contents with topological similarity of graph nodes described in Section 2. 

Similarity of node contents. The node content is a sequence of LLVM instructions. A 
simple way of measuring the similarity of two sequences of instructions si and S2 is 
using the edit distance between them d(si, S2) — the minimal number of insertion, 
deletion and substitution operations over the elements of the sequence by which 
one sequence can be transformed into another (Levenshtein, 1966). In order for 
edit distance to be computed, the cost of each insertion, deletion and substitution 
operation has to be defined. We define the cost of insertion and deletion of an 
instruction to be 1. Next, we define the cost of substitution of instruction ii by 
instruction 12- Let opcode be a function that maps an instruction to its opcode (a 
part of instruction that specifies the operation to be performed). Let opcode(i{) 
and opcode{i2) be function calls. Then, the cost of substitution is 1 if i\ and 12 call 
different functions, and if they call the same function. If opcode(i\) or opcode^) 
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is not a function call, the cost of substitution is 1 if opcode(i\) ^ opcode^), and 
otherwise. Let n± = \si\, na = | S2 1 , and let M be the maximal edit distance over 
two sequences of length n\ and 712. Then, the similarity of sequences si and S2 is 
defined as 1 — d(si, S2)/M. 

Although it could be argued that the proposed similarity measure is rough 
since it does not account for differences of instruction arguments, it is simple, 
easily implemented, and intuitive. 

Full similarity of nodes and similarity of CFGs. The topological similarity of nodes 
can be computed by the method described in Section 2. However, purely topo- 
logical similarity does not account for differences of the node content. Hence, we 
modify the computation of topological similarity to include the apriori similarity 
of nodes. The modified update rule is: 

x ij <- v y%j ■ 2 

where yij are the similarities of contents of nodes i and j and s f J ^ 1 (*,i) and 
Sgut(i,j) are defined by Equations 1. Also, we set x®j = yij. This way, both 
content similarity and topological similarity of nodes are taken into consideration. 
The similarity of CFGs can be defined based on the node similarity matrix as 
described in Section 2. Note that both the similarity of nodes and the similarity 
of CFGs take values in the interval [0, 1]. 

ft should be noted that our approach provides both the similarity measure for 
CFGs and the similarity measure for their nodes In addition to evaluating 

similarity of programs, this approach enables matching of related parts of the 
programs by matching the most similar nodes of CFGs. This could serve as a 
basis of a method for suggesting which parts of the student's program could be 
further improved. 



6.2 Experimental evaluation 

In order to show that the proposed program similarity measure corresponds to 
some intuitive notion of program similarity, we performed the following experi- 
ment. For each program from the corpus already described in Section 4, we found 
the most similar program from the rest of the corpus and counted how often these 
programs are the solutions for the same problem. That was the case for 90% of all 
programs. This shows that our similarity measure performs well since with high 
probability, for each program, the program that is the most similar to it, corre- 
sponds to the same problem. The inspection suggests that in most cases, where 
the programs do not correspond to the same problem, student took an original 
approach to solving the problem. 

The CFGs of the programs from the corpus are rather small. The average size 
of CFGs is 15 nodes. The time spent to compute the similarity of two programs 
is negligible. However, out of the educational context where CFGs could have 
thousands of nodes, the scalability might be an issue. 



14 



Milena Vujosevic-Janicic et al. 



6.3 Feedback for students and teachers 

The students can benefit from program similarity evaluation while learning and 
exercising, assuming that the teacher provided a valid solution or set of solutions 
to the evaluation system. In introductory programming courses, most often a stu- 
dent's solution can be considered as better if it is more similar to one of the 
teacher's solutions (Wang et al., 2007). In Section 7 we show that the similarity 
measure can be used for automatic calculation of a grade (a feedback that students 
easily understand). Moreover, we show that there is a significant linear dependence 
of the grade on the similarity value. Due to that linearity, the similarity value can 
be considered as an intuitive feedback, but also it can be translated into descrip- 
tive estimate. For example, the feedback could be that the solution is dissimilar 
(0-0.5), roughly similar (0.5-0.7), similar (0.7-0.9) or very similar (0.9-1) to one of 
the desired solutions. 

The teachers can use the similarity information in automated grading, as dis- 
cussed in Section 7. 

7 Automated Grading 

We believe that automated grading can be performed by calculating a linear com- 
bination of different scores measured for the student's solution. We propose a linear 
model for prediction of the teacher-provided grade of the following form: 

§ = Ql • XI + Q2 ■ X2 + «3 • x 3 

where 

— y is the automatically predicted grade, 

— xi is a result obtained by automated testing expressed in the interval [0, 1], 

— %2 is 1 if in the student's solution is correct as reported by the software verifi- 
cation tool, and otherwise, 

— X3 is the maximal value of similarity between the student's solution and each 
of the teacher provided solutions (its range is [0,1]). 

It should be noted that we do not use bug count as a parameter, as discussed 
in Section 5.2. Different choices for the coefficients a.j, for i = 1,2,3 could be 
proposed. In our case, one simple way could be a\ = 8, oli = 1, and 03 = 1 since 
all programs in our training set won 80% of the full grade due to the success in 
testing. However, it is not always clear how the teacher's intuitive grading criterion 
can be factored to automatically measurable quantities. Teachers need not have 
the intuitive feeling for all the variables involved in the grading. For instance, 
the behavior of any of the proposed similarity measures including ours (Wang 
et al., 2007; Naude et al., 2010; Nikolic, 2013) is not clear from their definitions 
only. So, it may be unclear how to choose weights for different variables when 
combining them in the final grade or if some of the variables should be nonlinearly 
transformed in order to be useful for grading. A natural solution is to try to tune 
the coefficients on, for i = 1,2,3 so that the behavior of the predictive model 
corresponds to the teacher's grading style. For that purpose, coefficients can be 
determined automatically using least squares linear regression (Gross, 2003) if a 
manually graded corpus of students' programs is provided by the teacher. 
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In our evaluation the corpus of programs was split into a training and a test 
set where the training set consisted of two thirds of the corpus and the test set 
consisted of one third of the corpus. The training set contained solutions of eight 
different problems and the test set contained solutions of remaining seven prob- 
lems. 

Due to the nature of the corpus, for all the instances it holds xi = 1. Therefore, 
while it is clear that the number of test cases the program passed (x\) is useful in 
automated grading, this variable can not be analyzed based on this corpus. 

The optimal values of coefficients Qj, i = 1,2,3, with respect to the training 
corpus, are determined using least squares linear regression. The obtained equation 
is 

y = 6.058 ■ x-i + 1.014 • x 2 + 2.919 ■ x 3 

The formula for y may seem counterintuitive. Since the minimal grade in the 
corpus is 8 and x\ = 1 for all instances, one would expect that it holds ai ~ 8. 
The discrepancy is due to the fact that for the solutions in the corpus, the minimal 
value for X3 is 0.68 — since the solutions are good (they all passed the testing) 
there are no programs with low similarity value. Taking this into consideration, 
one can rewrite the formula for y as 

y = 8.043 • xi + 1.014 • x 2 + 0.934 ■ x' 3 

where x' 3 = ^r^fff so the variable x' 3 takes values from the interval [0,1]. This 
means that when the range of variability of both x 2 and x 3 is scaled to the interval 
[0, 1], their contribution to the mark is rather similar. 

Table 2 shows the comparison between the model y and three other models. 
Model j/i = 8 ■ x\ + x 2 + £3 has predetermined parameters, model y 2 is trained 
just with verification information x 2 (without similarity measure), and model y 3 is 
trained only with similarity measure 2-3 (without verification information). Results 
show that the performance of model y on the test set (consisting of problems not 
appearing in the training set) is outstanding — the correlation is 0.842 and the 
model accounts for 71% of the variability of teacher provided grade. These results 
indicate a strong and reliable dependence between teacher provided grade and the 
variables x^, meaning that a grade can be reliably predicted by y. Also, y is much 
better than other models. This shows that the approach using both verification 
information and graph similarity information is superior to approaches using only 
one source of information, and also that automated tuning of coefficients of the 
model provides better prediction than giving them in advance. 

Inspection of solutions that yielded the biggest error in prediction suggests 
that the greatest source of discrepancy of automatically provided and teacher 
provided grades are the original solutions given by students and the solutions that 
the teacher did not predict in advance. However, we cannot exclude other factors 
apart form presence of bugs and similarity to model solutions, that govern human 
grading process. 

8 Related work 

Automated testing is the most common way of evaluating students' programs 
(Douce et al., 2005). Test cases are usually supplied by a teacher and/or ran- 
domly generated (Mandal et al., 2007). A lot of systems use this approach, for 
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r 



r z ■ 100% 



Rel. error 



y 
m 
m 
m 



0.842 
0.730 
0.620 
0.457 



71% 
53.3% 
38.4% 
20.9% 



10.1% 
12.8% 
16.7% 
17.7% 



Table 2 The performance of the predictive model on the training and test set. We provide 
correlation coefficient (r), the fraction of variance of y accounted by the model (100 • r 2 ), and 
relative error — average error divided by the length of the range in which the grades vary 
(which is 8 to 10 in the case of this particular corpus). 



example, PSGE (Hext & Winings, 1969), Kassandra (Matt, 1994), BOSS (Joy, 
Griffiths, & Boyatt, 2005), WebToTeach (Arnow & Barshay, 1999), Schemerobe 
(Saikkonen, Malmi, & Korhonen, 2001), TRY (Jones, 2001), HoGG (Morris, 2002), 
BAGS (Morris, 2003), on-line Judge (Cheang, Kurnia, Lim, & Oon, 2003), JEWL 
(English, 2004), Quiver (Ellsworth, Fenwick, & Kurtz, 2004), and JUnit (Wick, 
Stevenson, & Wagner, 2005). 

Software verification techniques are not commonly used in automated evalua- 
tion of programs. There are limited experiments on using Java PathFinder model 
checker for automated test case generation (Ihantola, 2007) . Tools with integrated 
support for automated testing and verification, e.g. Ceasar (Garavel, 1998), are 
usually too complex and not aimed for educational purposes. To the authors' 
knowledge, there is no other software verification tool deployed in process of auto- 
mated bug finding as a complement to automated testing of students' programs. 
The tool LAV was already used, to a limited extent, for finding bugs in students' 
programs (Vujosevic-Janicic & Kuncak, 2012). In that work, a different sort of 
corpus was used, as discussed in Section 5.2. Also, that application did not aim 
at automated grading, and instead was made in the wider context of design and 
development of LAV as a general-purpose SMT-based error finding platform. 

Wang et al. proposed a grading approach for assignments in C based only on 
program similarity (Wang et al. , 2007) . It relies on dependence graphs (Horwitz & 
Reps, 1992) as program representation. They perform various code transformations 
in order to standardize the representation of the program. In this approach, the 
similarity is calculated based on comparison of structure, statement, and size which 
are weighted by some predetermined coefficients. Their approach is evaluated on 
10 problems, 200 solutions each, and obtain good results compared to manual 
grading. Manual grading was performed strictly according to the criterion that 
indicates how the scores are awarded for structure, statements used, and size. 
However, it is not quite obvious that human grading is always expressed strictly 
in terms of these three factors. An advantage of our approach compared to this 
one is automated tuning of weights corresponding to different variables used in 
grading, instead of using the predetermined ones. Since teachers do not need to 
have an intuitive feeling for different similarity measures, it may be unclear how 
the corresponding weights should be chosen. Also, we avoid language dependent 
transformations by using LLVM which makes our approach applicable to large 
variety of programming languages. Very similar approach to the one of Wang et 
al. was presented by Li et al. (Li, Pan, Zhang, Chen, Nie, & He, 2010). 

Another approach to grading assignments based only on graph similarity mea- 
sure is proposed by Naude et al. (Naude et al., 2010). They represent programs 
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as dependence graphs and propose directed acyclic graph (DAG) similarity mea- 
sure. In their approach, for each solution to be graded, several similar solutions in 
the training set are found and the grade is formed by combining grades of these 
solutions with respect to matched portions of the similar solutions. The approach 
was evaluated on one assignment problem and the correlation between human and 
machine provided grades is the same as ours. For appropriate grading they recom- 
mend at least 20 manually graded solutions of various qualities for each problem 
to be automatically graded. In the case of automatic grading of high quality so- 
lutions (as is the case with our corpus), using 20 manually graded solutions, their 
approach achieves 16.7% relative error, while with 90 manually graded solutions 
it achieves around 10%. The improvement that our approach provides is reflected 
through several indicators. We used a heterogeneous corpus of 15 problems instead 
of one. Our approach uses 1 to 3 model solutions for each problem to be graded and 
a training set for weight estimation which does not need to contain the solutions 
for the program to be graded. So, after the initial training has been performed, for 
each new problem only few model solutions should be provided. Using 1 to 3 model 
solutions, we achieve 10% relative error (see Table 2). Due to the use of the LLVM 
platform, we do not use language dependent transformations, so our approach is 
applicable to large number of programming languages. The similarity measure we 
use, called neighbor matching, is similar to the one of Naude et al., but for our 
measure, important theoretical properties (e.g. convergence) are proven (Nikolic, 
2013). The neighbor matching method was already applied to several problems 
but in all these applications its use was limited to ordinary graphs with nodes 
without any internal specifics. In order to be applied to CFGs, the method was 
modified to include node content similarity which was independently defined as 
described in Section 6.1. 

Finally, as a distinctive feature of our system, we are not aware of open source 
implementations of the similarity based approaches. A drawback in the comparison 
of our approach to previously described ones is that our corpus consists of high 
quality solutions due to the grading setting at the course. 

Apart of assignment grading, regression techniques were also used for final 
grade forecasting with good results. For this purpose, Macfadyen et al. used data 
from learning management system and identified variables most useful for the 
prediction, e.g., number of assessments completed and number of discussion and 
mail messages sent (Macfadyen & Dawson, 2010). Kotsiantis performed successful 
forecasting based on demographic characteristics of students, results of several 
written assignments, and class attendance (Kotsiantis, 2012). 

9 Conclusions and Further Work 

We presented two techniques that can be used for improving automated evalua- 
tion of students' programs. First one is based on software verification and second 
one on CFG similarity measurement. Both techniques can be used for providing 
useful and helpful feedback to students and for improving automated grading for 
teachers. In our evaluation, we show that synergy of these techniques offers more 
information useful for automated grading than any of them independently. Also, 
we obtained good results in prediction of the grades for a new set of assignments. 
This shows that our approach can be trained to adapt to teacher's grading style on 
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several teacher graded problems and then be used on different problems using only 
few model solutions per problem. An important advantage of our approach is in- 
dependence of specific programming language since LLVM platform (which we use 
to produce intermediate code) supports large number of programming languages. 
We also provide the corresponding open source tools. 

In our future work we are planning to make an integrated web-based system 
with support for the mentioned techniques along with compiling, automated test- 
ing, profiling and detection of plagiarism of students' programs. Also, we intend 
to improve feedback to students by indicating missing or redundant parts of code 
compared to the teacher's solution. This feature would rely on the fact that our 
similarity measure provides the similarity values for nodes of CFGs, and hence 
enables matching the parts of code between two solutions. If some parts of the 
solutions cannot be matched or are matched with very low similarity, this can be 
reported to the student. On the other hand, the similarity of the CFG with itself 
could reveal the repetitions of parts of the code and suggest that refactoring could 
be performed. We are planning to integrate LLVM-based open source tool KLEE 
(Cadar et al., 2008) for automated test case generation and also to add support 
for teacher supplied test cases. 

We are also planning to explore potential for using software verification tools 
for proving functional correctness of students' programs. This task would pose new 
challenges. Testing, profiling, bug finding and similarity measurement are used on 
original students' programs, which makes the automation easy. For verification 
of functional correctness, the teacher would have to define correctness conditions 
(possibly in terms of implemented functions) and insert corresponding assertions 
in appropriate places in students' programs which should be possible to automate 
in some cases, but it is not trivial in general. In addition, for some programs it is 
not easy to formulate correctness conditions (for example, for programs that are 
expected only to print some messages on standard output). 
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